Though far from a perfect assessment of academic preparedness, SAT scores are often used as one measurement of a state’s education system. The education data stored at https://www.macalester.edu/~ajohns24/data/sat.csv contain various education variables for each state:
education <- read.csv("https://www.macalester.edu/~ajohns24/data/sat.csv")
| State | expend | ratio | salary | frac | verbal | math | sat | fracCat |
|---|---|---|---|---|---|---|---|---|
| Alabama | 4.405 | 17.2 | 31.144 | 8 | 491 | 538 | 1029 | (0,15] |
| Alaska | 8.963 | 17.6 | 47.951 | 47 | 445 | 489 | 934 | (45,100] |
| Arizona | 4.778 | 19.3 | 32.175 | 27 | 448 | 496 | 944 | (15,45] |
| Arkansas | 4.459 | 17.1 | 28.934 | 6 | 482 | 523 | 1005 | (0,15] |
| California | 4.992 | 24.0 | 41.078 | 45 | 417 | 485 | 902 | (15,45] |
| Colorado | 5.443 | 18.4 | 34.571 | 29 | 462 | 518 | 980 | (15,45] |
A codebook is provided by Danny Kaplan who also made these data accessible:
Figure 1.1: Codebook for SAT data. Source: https://www.macalester.edu/~kaplan/ISM/datasets/data-documentation.pdf
To examine the variability in average SAT scores from state to state, let’s start with a univariate density plot:
ggplot(education, aes(x=sat))+
geom_density(fill="blue",alpha=.5)
The first question we’d like to answer is to what degree do per pupil spending (expend) and teacher salary explain this variability? We can start by plotting each against sat, along with a best fit linear regression model:
ggplot(education, aes(y=sat,x=salary))+
geom_point()+
geom_smooth(se=FALSE,method="lm")
ggplot(education, aes(y=sat,x=expend))+
geom_point()+
geom_smooth(se=FALSE,method="lm")
Exercise 1.1 Is there anything that surprises you in the above plots? What are the relationship trends?
It is surprising that as teacher salary and school expenditures increase, the SAT scores actually show a decreasing trend based on the fit lines
Exercise 1.2 Make a single scatterplot visualization that demonstrates the relationship between sat, salary, and expend. Summarize the trivariate relationship between sat, salary, and expend. Hints: 1. Try using the color or size aesthetics to incorporate the expenditure data. 2. Include some model smooths with geom_smooth() to help highlight the trends.
ggplot(education, aes(y=sat, x=salary, color = expend)) +
geom_point() +
geom_smooth()
Exercise 1.3 The fracCat variable in the education data categorizes the fraction of the state’s students that take the SAT into low (below 15%), medium (15-45%), and high (at least 45%).
fracCat variable to better understand how many states fall into each category.fracCat and sat. What story does your graphic tell?fracCat, sat, and expend. Incorporate fracCat as the color of each point, and use a single call to geom_smooth to add three trendlines (one for each fracCat). What story does your graphic tell?ggplot(education, aes(x=fracCat)) +
geom_bar(position="dodge")
ggplot(education, aes(y = sat, x=fracCat)) +
geom_boxplot()
This data shows that actually states that have a higher percentage of their students taking the SAT tend to score worse as a whole with the lowest median score below 900 being from states with 45-100% of their students taking the test and the states with 0-15% scoring a median of around 1030.
Note that each variable (column) is scaled to indicate states (rows) with high values (pink) to low values (blue). With this in mind you can scan across rows & across columns to visually assess which states & variables are related, respectively. You can also play with the color scheme. Type ?cm.colors in the console to see various options.
ed<-as.data.frame(education) # convert from tibble to data frame
row.names(ed)<-ed$State
ed<-ed[,2:8]
ed_mat <- data.matrix(ed)
heatmap.2(ed_mat, Rowv=NA, Colv=NA, scale="column",
keysize=.7,density.info="none",
col=heat.colors(256),margins=c(10,20),
colsep=c(1:7),rowsep=(1:50), sepwidth=c(0.05,0.05),
sepcolor="white",cexRow=2,cexCol=2,trace="none",
dendrogram="none")
Heat map with row clusters
It can be tough to identify interesting patterns by visually comparing across rows and columns. Including dendrograms helps to identify interesting clusters.
heatmap.2(ed_mat, Colv=NA, scale="column",keysize=.7,
density.info="none",col=heat.colors(256),
margins=c(10,20),
colsep=c(1:7),rowsep=(1:50), sepwidth=c(0.05,0.05),
sepcolor="white",cexRow=2,cexCol=2,trace="none",
dendrogram="row")
Heat map with column clusters
We can also construct a heat map which identifies interesting clusters of columns (variables).
heatmap.2(ed_mat, Rowv=NA, scale="column",keysize=.7,
density.info="none", col=heat.colors(256),
margins=c(10,20),
colsep=c(1:7),rowsep=(1:50), sepwidth=c(0.05,0.05),
sepcolor="white",cexRow=2,cexCol=2,trace="none",
dendrogram="column")
There’s more than one way to visualize multivariate patterns. Like heat maps, these star plot visualizations indicate the relative scale of each variable for each state. With this in mind, you can use the star maps to identify which state is the most “unusual.” You can also do a quick scan of the second image to try to cluster states. How does that clustering compare to the one generated in the heat map with row clusters above?
stars(ed_mat, flip.labels=FALSE,
key.loc=c(15,1.5),cex=1.5)
stars(ed_mat, flip.labels=FALSE,
key.loc=c(15,1.5), cex=1.5,draw.segments=TRUE)
I think the starplots are slightly easier to read the cluster ideas just because the lines in the heat map cross a lot and I find it hard to follow the path of some of the lines.